Note: Might not be available on all browsers; use Chromium-based or Firefox.
Use rvest to scrape
library(rvest)library(tidyverse)# 1. Request & collect raw htmlhtml <-read_html("https://en.wikipedia.org/w/index.php?title=World_Happiness_Report&oldid=1165407285")# 2. Parsehappy_table <- html |>html_elements(".wikitable") |># select the right elementhtml_table() |># special function for tablespluck(3) # select the third table# 3. No wrangling necessaryhappy_table
# A tibble: 153 × 9
`Overall rank` `Country or region` Score `GDP per capita` `Social support`
<int> <chr> <dbl> <dbl> <dbl>
1 1 Finland 7.81 1.28 1.5
2 2 Denmark 7.65 1.33 1.50
3 3 Switzerland 7.56 1.39 1.47
4 4 Iceland 7.50 1.33 1.55
5 5 Norway 7.49 1.42 1.50
6 6 Netherlands 7.45 1.34 1.46
7 7 Sweden 7.35 1.32 1.43
8 8 New Zealand 7.3 1.24 1.49
9 9 Austria 7.29 1.32 1.44
10 10 Luxembourg 7.24 1.54 1.39
# ℹ 143 more rows
# ℹ 4 more variables: `Healthy life expectancy` <dbl>,
# `Freedom to make life choices` <dbl>, Generosity <dbl>,
# `Perceptions of corruption` <dbl>
## Plot relationship wealth and life expectancyggplot(happy_table, aes(x =`GDP per capita`, y =`Healthy life expectancy`)) +geom_point() +geom_smooth(method ='lm')
# 1. Request & collect raw htmlhtml <-read_html("https://en.wikipedia.org/w/index.php?title=List_of_prime_ministers_of_the_United_Kingdom&oldid=1166167337") # I'm using an older version of the site since some just changed it# 2. Parsepm_table <- html |>html_element(".wikitable:contains('List of prime ministers')") |>html_table() |>as_tibble(.name_repair ="unique") |>filter(!duplicated(`Prime ministerOffice(Lifespan)`))# 3. No wrangling necessarypm_table
# A tibble: 75 × 11
Portrait...1 Portrait...2 Prime ministerOffice(Lifespa…¹ `Term of office...4`
<chr> <chr> <chr> <chr>
1 "Portrait" "Portrait" Prime ministerOffice(Lifespan) start
2 "" "" Robert Walpole[27]MP for King… 3 April1721
3 "" "" Spencer Compton[28]1st Earl o… 16 February1742
4 "" "" Henry Pelham[29]MP for Sussex… 27 August1743
5 "" "" Thomas Pelham-Holles[30]1st D… 16 March1754
6 "" "" William Cavendish[31]4th Duke… 16 November1756
7 "" "" Thomas Pelham-Holles[32]1st D… 29 June1757
8 "" "" John Stuart[33]3rd Earl of Bu… 26 May1762
9 "" "" George Grenville[34]MP for Bu… 16 April1763
10 "" "" Charles Watson-Wentworth[35]2… 13 July1765
# ℹ 65 more rows
# ℹ abbreviated name: ¹`Prime ministerOffice(Lifespan)`
# ℹ 7 more variables: `Term of office...5` <chr>, `Term of office...6` <chr>,
# `Mandate[a]` <chr>, `Ministerial offices held as prime minister` <chr>,
# Party <chr>, Government <chr>, MonarchReign <chr>
links <- html |>html_elements(".wikitable:contains('List of prime ministers') b a") |>html_attr("href")title <- html |>html_elements(".wikitable:contains('List of prime ministers') b a") |>html_text()tibble(name = title, link = links)
# A tibble: 90 × 2
name link
<chr> <chr>
1 Robert Walpole /wiki/Robert_Walpole
2 George I /wiki/George_I_of_Great_Britain
3 George II /wiki/George_II_of_Great_Britain
4 Spencer Compton /wiki/Spencer_Compton,_1st_Earl_of_Wilmington
5 Henry Pelham /wiki/Henry_Pelham
6 Thomas Pelham-Holles /wiki/Thomas_Pelham-Holles,_1st_Duke_of_Newcastle
7 William Cavendish /wiki/William_Cavendish,_4th_Duke_of_Devonshire
8 Thomas Pelham-Holles /wiki/Thomas_Pelham-Holles,_1st_Duke_of_Newcastle
9 George III /wiki/George_III
10 John Stuart /wiki/John_Stuart,_3rd_Earl_of_Bute
# ℹ 80 more rows
Note: these are relative links that need to be combined with https://en.wikipedia.org/ to work
Exercises 2
For extracting text, rvest has two functions: html_text and html_text2. Explain the difference. You can test your explanation with the example html below.
html <-"<p>This is some text some more text</p><p>A new paragraph!</p> <p>Quick Question, is web scraping: a) fun b) tedious c) I'm not sure yet!</p>"|>read_html()
How could you convert the links objects so that it contains actual URLs?
How could you add the links we extracted above to the pm_table to keep everything together?
Example: Getting content from embedded json
html <-read_html("https://news.sky.com/story/crowdstrike-company-that-caused-global-techno-meltdown-offers-partners-10-vouchers-to-say-sorry-and-they-dont-work-13184488")data <- html %>% rvest::html_element("[type=\"application/ld+json\"]") %>% rvest::html_text() %>% jsonlite::fromJSON()datetime <- data$datePublished %>% lubridate::as_datetime()# headlineheadline <- data$headline# authorauthor <- data$author$nametext <- html %>% rvest::html_elements(".sdc-article-body p") %>% rvest::html_text2() %>%paste(collapse ="\n")
html <-read_html("https://www.zeit.de/mobilitaet/2024-04/deutschlandticket-klimaschutz-oeffentliche-verkehrsmittel-autos-verkehrswende")html |>html_elements(".article-body p") |>html_text2()
[1] "Ganz Deutschland fährt Bahn. So fühlte sich das im Sommer 2022 zumindest an, als das 9-Euro-Ticket für drei Monate für überfüllte Züge sorgte. Die Bundesregierung und viele Menschen zeigten sich begeistert: So leicht war es also, Bürgerinnen und Bürger für die umweltfreundlichen öffentlichen Verkehrsmittel zu begeistern, man muss nur ein günstiges Ticket für ganz Deutschland anbieten."
[2] "Aber als die Bundesregierung den Nachfolger vorstellte, waren viele enttäuscht. 49 Euro monatlich kostet das Deutschlandticket und ist nur im Abo erhältlich. Euphorisch war nur noch die Bundesregierung. Doch jetzt, ein Jahr nach dem Start, kann man sagen: zu Recht. Zumindest, was die Fahrgastzahlen angeht."
🤔 Wait, that’s only the first two paragraphs!
💡 Websites use cookies to remember users (including logged in ones)
What are browser cookies
Small pieces of data stored on the user’s device by the web browser while browsing websites
Purpose:
Session Management: Maintain user sessions by storing login information and keeping users logged in as they navigate a website.
Personalization: Save user preferences, such as language settings or theme choices, to enhance user experience.
Tracking and Analytics: Track user behavior across websites for analytics and targeted advertising.
We can use them in scraping:
to get content from websites that require consent before giving access
to authenticate as a user with content access privileges
to access personalized content
to simulate real user behavior, reducing the chances of getting blocked by websites with anti-scraping measures
You can use browser extensions like “Get cookies.txt” for Chromium-based browsers or “cookies.txt” for Firefox to save your cookies to a file
Implications:
You need to keep cookies secure as they can authenticate others as you!
Special Requests: Behind Paywall Cookies!
library(cookiemonster)add_cookies("cookies.txt")
html <-request("https://www.zeit.de/mobilitaet/2024-04/deutschlandticket-klimaschutz-oeffentliche-verkehrsmittel-autos-verkehrswende") |># start a requestreq_options(cookie =get_cookies("zeit.de", as ="string")) |># add cookies to be sent with itreq_perform() |>resp_body_html() # extract html from responsehtml |>html_elements(".article-body p") |>html_text2()
Example: South African Parliament (a special case)
In the folder /data (relative to this document) there is a PDF with some text. Read it into R
The PDF has two columns, bring the text in the right order as a human would read it
Let’s assume you wanted to have this text in a table with one column indicating the section and one having the text of the section
Now let’s assume you wanted to parse this on the paragraph level instead
Optional Homework
You have seen some tools and tricks to scrape websites now. But your best ally in web scraping is experience! Until tomorrow noon, your task is to find a page on Wikipedia you find interesting and scrape content from there. Even if you don’t fully succeed, document the steps you take and note down where the information can be found. If you want to try to get some data you actuall need from a different website, your’re also welcome. But note that if you collect raw html in R and the data is not where it should be (e.g., the html elements containing panel names do not exist), you might have discovered a more advanced site, which we will cover later. Note that down and try another conference.
Deadline: Friday before class
Wrap Up
Save some information about the session for reproducibility.